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Abstract 

The Conservation of Energy plays a pivotal part in the develop- 
ment of the physical sciences. With the growth of computation and 
the study of other discrete token based systems such as the genome, 
it is useful to ask if there are conservation principles which apply to 
such systems and what kind of functional behaviour they imply for 
such systems. 

Here I propose that the Conservation of Hartley-Shannon Informa- 
tion plays the same over-arching role in discrete token based systems 
as the Conservation of Energy does in physical systems. I will go on 
to prove that this implies power-law behaviour in component sizes in 
software systems no matter what they do or how they were built, and 
also implies the constancy of average gene length in biological systems 
as reported for example by [23] , 

These propositions are supported by very large amounts of exper- 
imental data extending the first presentation of these ideas in 
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1 Preliminaries 



1.1 Conservation of Energy 

The Conservation of Energy is one of a few principles which are at the very 
heart of all physical systems. The principle has been modified over the years 
notably to take account of the 4-vectors of relativity and mass-equivalence 
but it remains pivotal. In the same year as Einstein's eponymous paper on 
general relativity (1915), Emmy Noether proved a remarkable theorem which 
amongst many other things shows that the principle of conservation of energy 
is a consequence of general invariance under time translations. 

The study of discrete systems is much younger and has only come of age 
in the digital age where we now routinely write millions of lines of source 
code to analyse terabytes of digital data. It is of great interest to see if there 
are similarly fundamental principles which apply to the evolution of discrete 
systems. 

This paper identifies such a principle and produces very large amounts of 
supporting evidence. The paper brings together various concepts which will 
be individually described briefly now. 

1.2 Power- laws 

Power-law behaviour can be represented by the pdf (probability density func- 
tion) p(s) of entities of a certain size s appearing in some process, being given 
by a relationship like:- 



where k is a constant, which on a log p - log s scale is a straight line 
with negative slope —a. It can easily be verified that the equivalent cdf 
(cumulative density function) c(s) derived by integrating (PQ), also obeys a 
power-law, (for a ^ 1). For noisy data, the cdf form is most often used 
because of its fundamental property of reducing noise, as noted by [16] and 
it is this form which will be used here. 

Power-law behaviour has been studied in a very wide variety of environ- 
ments, see for example [23] (linguistics), [TS] (economic systems) and the 
recent excellent reviews by [T3] and [T6]. In software systems there has been 



significant activity, much of it recent, [5], [12], [15], [3], [8], [17], [2], [6] and 



[TO] all discuss power-law behaviour but in rather different contexts. 

Mitzenmacher [T2] considers the distributions of file sizes in general filing 
systems and observed that such file sizes were typically distributed with a 
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lognormal body and a Pareto (i.e. power-law) tail. 

In comparison, Gorshenev and Pis'mak [8J studied the version control 
records of a number of open source systems with particular reference to the 
number of lines added and deleted at each revision cycle. 

In this paper, I will step back from this and look for more fundamental 
reasons why power-laws are so ubiquitous. 

1.3 Systems of discrete choices 

A system based on discrete choices is any system which is built from dis- 
crete pieces based on some available set of choices. The genome is a perfect 
example. This is an exceptionally complex system which has evolved over 
hundreds of millions of years, astonishingly from a set of only four choices, 
the four bases of the genetic alphabet, adenine, cytosine, guanine and thi- 
amine, (acgt). The human genome comprises some 3 billion such bases. I 
will refer to such choices as an alphabet. 

As an example, the first 60 bases of the measles virus3 are 

atggactcgc tatctgtcaa ccagatcttg taccccgaag ttcacctaga tagcccgata 

Computational science provides many more examples. In computational 
science, the source code of every computer program written by every re- 
searcher in pursuit of their computational results uses one or more program- 
ming languages. Such programs form an essential part of the vast majority 
of modern scientific work and parenthetically present a huge challenge to 
scientific reproducibility!?]. 

The individual bases or alphabet of a programming language are called 
tokens and may take two forms; the fixed tokens of the language as provided 
by the language designers and the variable tokens. Fixed tokens include (in 
the languages C and C++ for example) if, else, while, {, }. These can not 
be changed, the programmer can only choose to use them or not. Variable 
tokens, with some small lexical restrictions, can be arbitrarily invented by 
the scientific programmer whilst constructing their program. These might be 
identifier names such as numberOfCandidateCollisions or lengthOfGene or 
constants such as 3.14159265. Computational scientific systems and indeed 
every other form of software system evolve from such tokens. There are many 
programming languages but all obey the same principles. 

Such programs are often very large. The software deployed in the search 
for the recently discovered Higg's boson comprises around 4 million lines of 
code [T9] . At an average of around 6 tokens per line of code, this corresponds 

1 http: / /www. ncbi.nlm.nih.gov / nuccore/HM562900. 1 
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to some 20-25 million tokens, although this is still less than 1% of the human 
genome. The largest systems in use today appear to be around 100 million 
lines of source code [13], perhaps 15% of the number of tokens of the human 
genome. 

As an example, consider the following simpl^l bubblesort algorithm writ- 
ten in C, for example [20J. 

void bubble ( int a[] , int N) 
{ 

int i , j , t ; 

for( i = N; i >= 1; i— ) 
{ 

for( j = 2; j <= i; 
{ 

if ( a[j-l] > a[j] ) 
{ 

t = a[j-l]; a[j-l] = a[j] ; a[j] = t; 

} 

} 

> 

} 

This algorithm contains 94 tokens in all based on 18 of the fixed tokens 
of ISO C 

void int ()[]{,; for = >= — <= ++ if > - 

and the 8 variable tokens (i.e. invented by the programmer) 

bubble a N i j t 1 2 

Although programming languages have a much richer alphabet of tokens 
than genes, they obey the same principles - some external process chooses 
tokens from the available alphabet. I will argue that this process is driven 
by a beautiful underlying clockwork, that of Conservation of Information. 

2 Simple in the sense that nobody ever uses it because far faster sorting mechanicsms 
are known but it is useful for teaching purposes. 
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1 ... 7 ... 16 

Figure 1: A binary tree. Each level proceeding down can either go left or 
right. There are four levels leading down to one of 2 4 = 16 possibilities. 
Only 4 choices are needed to reach any of the possibilities. We note that 
Zo<72(16) = 4. Here the number 7 has been singled out by the choices left, 
right, left, right. 

1.4 Information theory 

Information theory has its roots in the work of Hartley [9] who showed that 
a message of N signs (i.e. tokens) chosen from an alphabet or code book of S 
signs has S N possibilities and that the quantity of information is most rea- 
sonably defined as the logarithm of the number of possibilities or choices. To 
gain a little insight into the reason why the logarithm makes sense, consider 
Figure [TJ The number of choices necessary to reach any of the 16 possible 
targets is the number of levels which is the log2 (number of possibilities). The 
base of the logarithm is not important here. 

Information theory was developed very substantially by the pioneering 
work of Shannon |21j. |22j . However it is important not to conflate informa- 
tion content with functionality or meaning and Cherry[4] specifically cautions 
against this noting that the concept of information based on alphabets as ex- 
tended by Shannon and Wiener amongst others, only relates to the symbols 
themselves and not their meaning. Indeed, Hartley in his original work, de- 
fined information as the successive selection of signs, rejecting all meaning 
as a mere subjective factor. In the sense used here therefore, Conservation 
of Information will be synonymous with Conservation of Choice, not mean- 
ing. This turns out to be enough to predict important system properties. In 
other words, those properties depend only on the alphabet and not on what 
combining tokens of the alphabet might mean in any human sense. 
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2 A statistical mechanical model of a discrete 
tokenised system 

Armed with these pieces of information, I will now describe a variational 
model in which Conservation of Information is a fundamental constraint as 
first described by [Uj. A general discrete tokenised system will be considered 
here as T tokens distributed in some way amongst M non-nested components, 
each containing ti tokens where i = 1, ... ,M. In software systems, a compo- 
nent might be a subroutine (Fortran), function (C) or procedure (Tcl-Tk). 
In 00 languages, it might be a method^. In a genetic system, a component 
might be a gene. 

Then the number of ways of organising this system is given by:- 

T! 

W = -rA r ( 2 ) 

h\t 2 \..t M \ v ; 

where 

M 



r = $> (3) 



Also suppose there is some externally imposed entity £j associated with 
each token of component i whose total amount is given by 



M 



U = Y^tiSi (4) 

i=i 

Using the method of Lagrangian multipliers as described in [10] . the most 
likely distribution satisfying equation fl2]) subject to the constraints in equa- 
tions ([3]) and (j3]) will be found. This is equivalent to maximising the following 
variational derived by taking the log of (J2J). 



M M M 

logW = TlogT - ^ Ulogfa) + \{T - ^ U} + f3{U - ^ t^} (5) 

i=l i=l i=l 

where A and ft are the multipliers and log is the natural logarithm. Setting 
8(logW) = and using the assumption that T and the U are both 3> 1 leads 
to 

3 Strictly speaking methods can be and usually are nested in 00 systems although when 
compiled they are simply treated as a function with some context and so remain relevant 
to this model of M non-nested components. Proper source analysis requires them to be 
treated the same way. 
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M 

= - 5t ii lo 9{U) + a + Pei} (6) 
»=i 

where a = 1 + A. This must be true for all variations Sti and so 

log(U) = -a - (5ei (7) 

Using equation (J3J) to replace a, this can be manipulated into the most 
likely, i.e. the equilibrium distribution 

U 



Defining Pi= h, equation (jSJ) then yields 

Following [18] and and referring to equation (j3J) , can then be interpreted 
as the probability that a component of size tj tokens is found is exponentially 
related to £j. The larger £j for example, the less likely such a component is 
to appear. 

Hinting at what is to come, we can see immediately that in any discrete 
system where Si is the logarithm of some quantity, then the resulting size 
distribution is overwhelmingly likely to be power-law since exp( -c log d ) = 
d c . 



2.1 Merging with information theory 

I will now repeat the argument I gave in [UJ. Suppose then that the unique 
alphabet of the i th component contains dj tokens and as defined above, 
tokens in all. The number of ways of arranging the tokens of this alphabet in 
component % is therefore a*\ Following Hartley, the quantity of information 
in component % will therefore be defined as 

I; = log{atf* = Ulogoi (10) 

To blend this into the variational method shown earlier, the same com- 
putational device used in [10] was used. I will introduce I, the total amount 
of information, as the sum of the information in each component as follows:- 

M 
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I will now identify £j with (f 1 ) in equation (jSJ). In other words, each token 
of component i has an externally imposed information density associated 
with it given by (£). This assumes that the information per token in a 
single component takes some average value but that this can vary amongst 
components. This seems reasonable in that it suggests that no particular 
token is any more important than any other when developing a particular 
component as some functional entity, however it allows for the fact that this 
can vary amongst different components which fits well with intuition that 
some functional entities are in some sense more complicated than others. 
Note finally that introducing this additional functional dependence of £j on 
ti does not disrupt the development which led to equation (jUJ) as £, is fixed 
externally by assumption. 

Equation can then be written as 



" = m (12) 



where 

M 



w) = E e "^ ( 13 ) 



i=l 

Combining equations f lT2|) and f fTOj) then gives 

e -f3logai 

= w (14) 

So we finish up with the following predicted power-law distribution 
subject to the twin constraints that the total number of tokens T is fixed 

M 



T = Y,U (16) 



and the total Hartley / Shannon information content I is also fixed 



M 



i = Y,^ ( 17 ) 



i=i 



where l{ is the information content of the i component 
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It is worth repeating that this overall process does not care about the 
tokens themselves - all individual microstates are equally likely. It simply 
says that if total size and choice in the Hartley-Shannon sense are conserved 
during the process of distributing the tokens, then power-law distribution 
of component size in the unique alphabet of tokens used is overwhelmingly 
likely to emerge since it occupies the vast majority of the microstates. 

Given its central position in what follows, it is useful to retrace the as- 
sumptions. These are 

1. The variational method assumes that both t; and T are 3> 1. This 
turns out to be a very good approximation for nearly all the data here. 
Components with £j < 10 are rare indeed in software systems because 
of the token overhead described earlier and genes are typically much 
longer. 

2. The variational method enforces that the total size T is kept constant 
whilst the most likely solution is found. It should be noted that this is 
not its actual size at any point in time, but the eventual size defined 
by its intended functionality in an ergodic sense. In other words, if 
the same system was produced many times independently, then for a 
particular T, the variational method finds the most likely distribution 
of ti subject to the constraints. 

3. The variational method also enforces the Conservation of Information. 
I is not the same as functionality, it is simply related to choice from 
the available alphabet in the Hartley-Shannon sense. 

3 Application to software systems 
3.1 Software components and tokens 

Following the earlier discussion, we can write the unique alphabet of a» tokens 
in the i th component of a software system of M components as 

a,i = cif + a v {i) (18) 

where a/ is the alphabet of fixed tokens and a v (i) is the alphabet of 
invented tokens and is clearly dependent on i, since programmers are free to 
create them as and when desired. 

It will be noted that Of is taken as independent of component whereas 
a v (i) is dependent on the component. To flesh this out a little, it is worth- 
while introducing a highly relevant property of programming languages at 
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this point. Smaller components tend to be dominated by tokens fixed by the 
programming language and larger components tend to be dominated by tokens 
invented by the programmer, for example constants and identifier names. The 
reasons for this are first, the fixed tokens of a language are limited in number 
and a significant number of these are very rarely used, (for example, the 10 
trigraphs or the goto in ISO C). Second, there is a certain token overhead 
which must be paid in order to produce the simplest of components. As the 
component size grows, the fixed token alphabet rapidly stabilises whilst the 
invented token alphabet grows without any such limits. It is therefore a rea- 
sonable assumption to consider the alphabet of fixed tokens as approximately 
constant across components. 

To support this conclusion, throughout these studies the (variable/fixed) 
token ratio was found to be typically 0.3 or less for the small components. 
In contrast, the same token ratio is typically greater than 5 for large com- 
ponents. In addition, on average, the fixed token population does not vary 
with component size - linear regression of df against tj on the 526,158 com- 
ponents extracted in this study revealed a gradient of around 7.0 x 10~ 4 , in 
other words it is effectively zero. 

In other words, as the component size grows, the fixed token alphabet 
hardly changes in this dataset whilst the variable token alphabet grows with- 
out any such limits. For example, more than 95% of the components analysed 
here used less than 30 fixed tokens. 

This has a profound effect on the predicted shape of the distribution as 
we will see. 



3.2 Predicted shape of the size distribution 

In anticipation of applying this to software data, it is conventional to consider 
the cdf (cumulative distribution function) as discussed earlier, rather than 
the pdf because of its much more stable behaviour in the presence of noise 
[IB] . This is given by integration of f JT5|) as 

c t ~ af +1 (19) 

for f3 7^ 1. It is then possible to anticipate the approximate predicted 
shape of the size distribution as follows. For small components, we have seen 
that it is reasonable to assume that the number of fixed tokens will tend to 
dominate the total number of tokens because of the fixed token overhead. In 
other words, a/ 3> a v (i). Equation (1T9~]) can then be written 

Q-Kr^i + ^V^ 1 (20) 

a f 
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logc 




log a 



Figure 2: The predicted cdf using the model described in this paper. The cdf 
is predicted to be approximately constant for small components and power- 
law in a, for large ones with a merging zone between. 

In other words, 



which implies that q will be tend to a constant for small components on 
a log-log plot. For large components, using the same arguments, the general 
rule applies 



The generic shape of the resulting predicted curve on a log-log scale is 
shown in Figure [2J 

3.3 Experimental verification 

An unusually large number of systems were analysed across multiple lan- 
guages in order to increase the statistical relevance. Open source has had 
many benefits but one of particular value to researchers is the enormous 
amount of source code which can be freely downloaded, often with excellent 
development history. In this study, 6 languages were chosen, Ada, C, C++, 
Fortran, Java and Tel. This covered a very wide variety of implementation 
areas and paradigms. In these languages, around 90 packages were down- 
loaded comprising some 55.5 million lines of code over many development 
areas and almost half a billion tokens in all. These include for example, the 
whole of the Linux kernel, PHP, X11R7, Postgresql and Perl (C), the Ada 




(21) 




(22) 
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validation system (Ada), the KDE desktop (C++) and the Java Virtual ma- 
chine (Java). As well as these, the author had access to some commercial 
systems in Tel, Fortran and C but these only totalled around 2% of the to- 
tal code analysed. Individual package sizes spanned 3 orders of magnitude 
varying from around 10 KSLOC (thousands of source lines of code) to some 
13.5 MSLOC (millions of source lines of code) as is found in version 2.6.27 
of the Linux kernel. 



3.3.1 Lexical analysis 

The extraction of tokens from a program is not a trivial process and requires 
the development of tools which mimic the front-end of compilers [TJ. The 
minimum requirements for a lexical analyser for each language considered 
here, were 

• The ability to extract tokens and to distinguish between the two token 
forms, fixed and variable. 

• The ability to recognise the start and end of a component. This is 
simpler for non-00 languages than 00 languages because the latter 
admit nested components or methods. In this analysis a useful approx- 
imation is that nested methods can be ignored. This has a small effect 
as noted later. 



The resulting generic tokeniser was written in C for optimal performance 
and also to exploit the well-known lex tool for generating lexical analysers^- 
It comprises around 2000 lines of C and 1300 lines of lexd. There may be 
certain difficult tokens in some languages which are simply ignored by this 
generic analyser. This excluded only a tiny fraction of components from the 
analysis however. As a quality control check, the C and Fortran analysers 
were checked against and found to agree closely with existing full parsers 
written by the author some years ago, both of which parsed the relevant 
compiler validation suites correctly, (FIPS160 in the case of ISO C90 and the 
ACVS in case of Fortran 77). The resulting generic tokeniser is extremely 
fast and can extract tokens at around the rate of 100,000 lines a second on 
a typical Linux desktop allowing the analysis of the very large amounts of 
source code considered here. 

4 , http://flex.sourceforge.net/manual/ 

5 For the purposes of open verification, it is included at 
http://www.leshatton.org/category/scientific-writing/datasets/. 
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55.5 million lines of Ada,C,C++,Fortran,Java,Tcl 




1 10 100 1000 

size x (tokens) 

Figure 3: The measured cdf for all 90 systems analysed here combined into 
one super-system. This comprises around 13% Java, 6% C++, 8% Fortran, 
Ada and Tel combined and around 73% C. This very roughly reflects the 
amount of each language freely available on open source, although 24 million 
lines came from Linux and BSD which are in C. Absenting these unusually 
large systems, the percentages are 22%, 10%, 15%, 53% respectively. 

3.3.2 Results 

Using the generic lexical analyser described in the previous section, All 90 
or so systems were analysed comprising some 55.5 million lines of code in six 
languages. To emphasise that the nature of the tokens or their meaning does 
not really matter ergodically, Figure [3] shows the measured cdf for the whole 
dataset together comprising almost half a billion tokens. 

This can be compared with the prediction represented by Figure El The 
two are emphatically alike. The predicted linearity of the power-law tail of 
Figure [3] was subjected to a standard test for significance using the linear 
modelling function lm() in the widely-used R statistical packagqj. This re- 
ported a very high degree of linearity with a linear-fit correlation of 0.998 
between token counts of 30 and 3000, a span of two decades. The same 
analysis reports a slope of -2.125 +/ 0.003, which is in the range -2 — > -3 
reported for most natural phenomena by [16J. The associated p- value, (the 
probability of finding a dataset more unlikely than this one by chance) is 
< 2.2e — 16, an extremely emphatic result. The corresponding output from 
R is shown below. 

> source ("plot.tail .R") 

6 http://www. r-project.org/ 
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Read 957 items 
Read 957 items 
> summary (fm) 

Call: 

lm(formula = y ~ x, data = universe) 
Residuals : 

Min 1Q Median 3Q Max 

-0.25510 -0.05790 -0.02892 0.05866 0.33697 

Coefficients : 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 20.285185 0.019336 1049.1 <2e-16 *** 

x -2.125199 0.003133 -678.4 <2e-16 *** 



Residual standard error: 0.08646 on 955 degrees of freedom 
Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979 

F-statistic: 4.603e+05 on 1 and 955 DF, p-value: < 2.2e-16 

The qualitative prediction of the asymptotic constant behaviour of the 
cdf for small components is also reassuring. It can be concluded that this 
experiment very strongly supports the model presented in [JTj and repeated 
here as ( TT5|) . The resulting behaviour implicit in Figure [3] contrasts nicely 
with the pure straight line predicted for monkeys pounding on keyboards as 
eloquently described by [13]. The ergodic nature of (ITS]) simply accumulates 
all possible programmers pounding on keyboards. As will be seen, it also 
works well with much smaller numbers, i.e. individual systems, a character- 
istic of classical statistical mechanics. 

3.3.3 Individual systems 

It was mentioned earlier that classical statistical mechanics results often re- 
main robust at smaller values of T. Figure H] shows a collage of some of the 
individual C systems in the range 500,000 - 1,100,000 lines of code all of 
which show good similarity with the generic model. 

As can be seen by studying the animation at |http: / / www.le shatton.org/ wp-content / uploads / 2( 
the generic shape of (fl~5]) appears fairly early on, certainly within the first 1% 
of the total data represented by Figure [3j To give some of idea of medium 
and small systems, Figures |5] and |6] show a collage of individual systems in 
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PHP 5.3.6(C) 



MPlayer video editing (C) 






Figure 4: Large C applications in the range 500,000 - 1,100,000 SLOC. 
Shown are the latest versions of each of PHP, Mplayer, GIMP, MySQL, 
GTK, X11R7, TclTk and glib at the time of writing. 



the ranges 150,000 - 400,000 lines of code and 8,000 - 90,000 lines of code 
respectively. 

In contrast, Figure [7] is a collage of various languages of various system 
sizes as detailed. Comparing the slight curvature in the tail of the large OO 
packages in this Figure with the equally large C packages shown in Figure H] 
reveals the effects of the approximation of neglecting nested small methods 
as described earlier. 

For each of Figures HI El [6] and [71 the packages decrease in line count 
from top left to bottom right. Note that there also some small changes in 
the y-axis scale. 

3.3.4 Persistence 

Given that the signature of (|15[) is visible even in individual packages, it is 
useful to consider the time of appearance of this behaviour. Is it present 
in the first release of a software system or does it emerge as that system 
is systematically refined during a maintenance cycle ? The constrained de- 
velopment model described in [11] suggests that the distribution of tokens 
by programmers under these constraints as they reason about a system un- 
consciously drives the development at all stages. In other words, it might 
be expected that this characteristic signature would appear early on in the 
development process. 

In addition, practical considerations suggest that it would be unusual to 
expect major changes in component size distribution as a system ages on 
the general grounds that engineers are reluctant to change working systems 
too much even as they adapt to changing requirements and other normal 
maintenance activities. 

To address this, the revision history of three very different systems (one 
Fortran, one in Tcl-Tk and one in C) was analysed. Two of these were taken 
from the very first release of 7-8 year life-cycles and one from around half 
way into its 25 year life-cycle from first release. 



Relevant parameters of these three packages are shown as Table 1. 



Package 


Language 


Releases 


Years 


start 
XLOC 


end 
XLOC 


Numerical li- 


Fortran 


8 


12 


90,198 


266,123 


brary 












Geophysical 


Tcl-Tk 


44 


7 


6,227 


11,078 


modelling 












Language parser 


C 


27 


8 


35,851 


65,270 
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R2.13 (C) 



ImageMagick 6.5.5 




size x (tokens) 

GIMP 2.6.11 (C) 



size x (tokens) 

Evolution 3.1.1 (C) 




Figure 5: Medium C applications in the range 150,000 - 400,000 SLOC. 



17 



GGOBI graphics 2.1.8 (C) 



Epiphany 3.0.2 (C) 




size x (tokens) 

Ingres database 8.9 (C) 



size x (tokens) 

gnome utils 3.0.1 (C) 



Size x (tokens) 

Numerical Recipes in C 




size x (tokens) 

GNAT Ada95 compiler 




size x (tokens! 

Polylib 5.04 (C) 



size x (tokens) 

Compression library (C) 



Figure 6: Small C applications in the range 8,000 - 90,000 SLOC. 
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Figure 7: Various Ada, Fortran 77 and 90, C++, Java and Tcl-Tk applica- 
tions in the range 40,000 - 4,500,000 SLOC. The y-axis is slightly extended 
compared with Figures HI [5] and El 



Table 1: Packages used to show power-law persistence 

The left hand upper diagram of Figure [8] shows the component size disri- 
bution for each official release of a widely used numerical library (the NAG 
Fortran library) from release 12 through release 19, spanning around 12 years. 
The last release analysed, release 19, comprised 3659 components containing 
altogether almost 270,000 XLOC Even though the library almost trebled 
in size over this period, there is little substantial change in the component 
size distribution across this time period. It remains possible that substantial 
change might have taken place in the releases prior to release 12. Although 
the data were not available to confirm this, it would be most unlikely for a 
scientific subroutine library to change significantly over its life-cycle by the 
very nature of its functionality. The solutions of mathematical algorithms 
hardly vary once implemented. 

The right hand diagram of Figure [S] shows the component size distribu- 
tions for every fourth release of a Tcl-Tk system for geophysical modelling 
as it grew by about a factor of two. 

The left hand lower diagram shows every third release of a C system for 
statically checking C programs as it grew by a factor of two from its first 
release. 

These are very different systems but the characteristic signature described 
by ( [HP appears to be a persistent property in the evolution of each system, 
present at first release and preserved during the maintenance cycle, even when 
that doubles or triples the initial released system size as is the case here. 

4 Application to genetic systems 

I will now apply the general principle expressed by ffT5l) to predicting prop- 
erties of the genome, another discrete token based system, and in particular 
that of gene length. 

4.1 Genetic background 

The authors of [23] surveyed almost all prokaryotic and eukaryotic species 
whose complete genome sequence data were then available and well anno- 
tated. These data included 81 prokaryotes and 19 eukaryotes and regressed 
the estimate of total coding sequence length against the estimate of the num- 
ber of genes for each of the two groups of species. They found that although 
the average lengths of genes in prokaryotes and eukaryotes are significantly 
different, the average lengths of genes are effectively constant within either 
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Figure 8: Distributions of a Fortran, Tcl-Tk and a C system over many 
releases. 
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of the two kingdoms leading to a linear relationship between the total length 
of the sequence and the number of contributing genes. They concluded that 
natural selection has clearly set a strong limitation on gene elongation within 
the kingdom and that the average gene size adds another distinct character- 
istic for the discrimination between the two kingdoms of organisms. Their 
data can be seen at http : / / mbe.oxfordjournals.org/ content /23/6/ 1 107.full| 

Here, I will propose that the reason why the average gene length is highly 
conserved within a kingdom is inevitable and is related to exactly the same 
underlying Conservation of Choice or Hartley-Shannon Information so em- 
phatically demonstrated above for software systems. 



5 A variational model for gene length 

First, suppose that genes have length ti bases chosen from a unique alphabet 
of dj bases. The complications of defining gene length which include taking 
account of introns, outrons and so on are avoided by using pre-analysed data 
as for example in [23] . 

The key observation is that the alphabet of bases in genetic codes is fixed 
to adenine, cytosine, guanine and thiamine. In other words, a% = 4,Vz. In 
this regard, although the genome is much bigger than any single software 
system, it is constructed from a much simpler alphabet. This then implies 
from (fT5|) that for such genetic sequences, 

Pi ~ K (23) 

where K is a constant. In other words, Conservation of Information im- 
plies that by far the most likely outcome is that gene lengths are distributed 
uniformly within whatever kingdom is being considered and so the average 
gene length E(L) is constant. Since there are M genes in a total coding length 
T, we have 

Constant = E{L) = C-^ (24) 
where C is some constant depending on the species. In other words 



T = k'M (25) 
where k' is another constant. This behaviour is precisely that found em- 



pirically by [23] as shown at http:/ /mbe.oxfordjournals.org/content/23/6/1107.full 



Note that this development says nothing about the kingdom. It simply 
says that subject to the total number of bases being constant and the total in- 
formation being constant in the Hartley-Shannon sense, the overwhelmingly 
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most likely distribution of the lengths of the genes will be uniform leading 
to the prediction that the average gene length is constant for a kingdom. 

It is worthwhile re-iterating the points made by [23] . First of all, they ar- 
gue "it is widely accepted that natural selection favours shorter genie coding 
sequence length for higher transcriptional efficiency, for efficient protein syn- 
thesis, and for avoiding accumulation of deleterious mutation. On the other 
hand however, evolution seems to improve the function of a protein through 
elongating its coding sequence" . They go on to say that "their observations 
suggest that there is a stringent structural constraint on evolution of gene 
size on a genomic scale. Furthermore species that have been diverged for 
more than a few billions of years ago in either prokaryotic (Prochlorococcus 
marinus) or eukaryotic (Ashbya gossypii) group share a relatively constant 
mean gene size" . 

These comments suggest the existence of a more profound underlying 
principle at work. / propose that this underlying principle is indeed the Con- 
servation of Information. In software systems, this has been demonstrated 
here to be overwhelmingly true whatever language is used, whatever the 
system does and howsoever the system was built. By analogy in genetic 
systems, this underlying clockwork is independent of the nature of evolu- 
tion. It merely populates the landscape which evolution traverses with an 
overwhelming number of places where average gene length is constant for a 
kingdom. 

As a result, it is worth re- iterating Cherry's caution [I] against over- 
emphasizing the relationship between information content and meaning, and 
by inference, functionality. The development using information content from 
equation (flQj) onwards leading to the relationship expressed by equation ( TT5|) 
is fundamental but it says little if anything about functionality. Indeed func- 
tionality seems irrelevant in the emergence of the properties described by 

(USD- 

The proper study of meaning is known as semiotics. In this discipline, 
rules acting on signs or tokens are split into three categories:- 

• Syntactic rules (rules of syntax; relations between signs) 

• Semantic rules (relations between signs and the things, actions, rela- 
tionships and qualities known collectively as designata) 

• Pragmatic rules (relations between signs and their users) 
The development described here relates only to the first category. 
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6 Conclusions 



The paper presents several contributions each supported by real systems data 
of different provenance. 

• Using variational principles suggested in [10] , [TT] and using the princi- 
ple of the Conservation of Information, it is predicted that the probabil- 
ity Pi of a component appearing with t{ tokens in any software system, 
whatever its implementation details, obeys the following distribution 
with respect to the size of its unique alphabet of tokens a*, 

p t ~ ((k)~ P (26) 

Overwhelming evidence for this behaviour has been presented derived 
from some 55.5 million lines of code in six languages with an associated 
p-value of < 2.2. 10" 16 . 

• The behaviour exemplified by ([2"B"j) has been demonstrated to be persis- 
tent through the life of single software systems as exemplified by three 
very disparate systems. 

• The behaviour exemplified by (I2"6"j) appears with relatively few tokens 
in software systems. In other words, equilibriation is quite rapid. 

• The underlying principle of Conservation of Information is shown to 
lead to a prediction that average gene length is constant in a kingdom. 
This is supported by independent data. 

In summary, the Conservation of Information in discrete token based sys- 
tems such as all software systems and biological systems such as the genome, 
appears to play a fundamental role in the development of those systems in 
a way comparable with the principle of Conservation of Energy in physical 
systems. 
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